Direct S3 Data Access


Timing:

  • Exercise: 20 minutes

Summary

In the previous exercises we searched for and discovered cloud data assets that met certain search criteria (i.e., intersects with our region of interest and for a specified date range). The end goal was to find and save web links to the data assets we want to use in our workflow. The links we found allow us to download data via HTTPS (Hypertext Transfer Protocol Secure). However, NASA allows for direct in-region S3 bucket access for the same assets. In addition to saving the HTTPS links, we also created and saved the S3 links for those same cloud assets and we will use them here. In this exercise we will demonstrate how to perform direction in-region S3 bucket access for Harmonized Landsat Sentinel-2 (HLS) cloud data assets.

Direct S3 Access

NASA Eartdata Cloud provides two pathways for accessing data from the cloud. The first is via HTTPS. The other is through direct S3 bucket access. Below are some benefits and considerations when choosing to use direct S3 bucket access for NASA cloud assets.

Benefits

  • Retrieve data is very quickly
  • No need to download data! Work with data in a more efficient manner
  • Increased capacity to do parallel processing
  • You are working completely with the AWS cloud ecosystem and thus have access to the might of all AWS offerings (e.g., infrastructure, S3 API, services, etc.)

Considerations

  • Access only works within AWS us-west-2 region
  • Need an AWS S3 “token” to access S3 Bucket
  • Token expires after 1 hour
  • Token only works at the DAAC that generates it, e.g.,
  • S3 on its own does not solve ‘cloud’ problems, but it is one key technology in solving big data problems
  • Still have to load things in to memory, parallelize the computation, if working with really large data volumes. There are a lot of tool that allow you to do that, not discussed in this tutorial

Objective

  • Configure our notebook environment and retrieve temporary S3 credentials for in-region direct S3 bucket access
  • Access a single HLS file
  • Access and clip an HLS file to a region of interest
  • Create an HLS time series data array

Let’s get started!


Import Required Packages

%matplotlib inline
import matplotlib.pyplot as plt
from datetime import datetime
import os
import subprocess
import requests
import boto3
import pandas as pd
import numpy as np
import xarray as xr
import rasterio as rio
from rasterio.session import AWSSession
from rasterio.plot import show
import rioxarray
import geopandas
import pyproj
from pyproj import Proj
from shapely.ops import transform
import geoviews as gv
from cartopy import crs
import hvplot.xarray
import holoviews as hv
gv.extension('bokeh', 'matplotlib')

Configure Local Environment and Get Temporary Credentials

To perform direct S3 data access one needs to acquire temporary S3 credentials. The credentials give users direct access to S3 buckets in NASA Earthdata Cloud. AWS credentials should not be shared, so take precautions when using them in notebooks our scripts. Note, these temporary credentials are valid for only 1 hour. For more information regarding the temporary credentials visit https://data.lpdaac.earthdatacloud.nasa.gov/s3credentialsREADME. A netrc file is required to aquire these credentials. Use the NASA Earthdata Authentication to create a netrc file in your home directory.

s3_cred_endpoint = 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials'
def get_temp_creds():
    temp_creds_url = s3_cred_endpoint
    return requests.get(temp_creds_url).json()
temp_creds_req = get_temp_creds()
#temp_creds_req                      # !!! BEWARE, removing the # on this line will print your temporary S3 credentials.

Insert the credentials into our boto3 session and configure our rasterio environment for data access

Create a boto3 Session object using your temporary credentials. This Session is used to pass credentials and configuration to AWS so we can interact wit S3 objects from applicable buckets.

session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'], 
                        aws_secret_access_key=temp_creds_req['secretAccessKey'],
                        aws_session_token=temp_creds_req['sessionToken'],
                        region_name='us-west-2')

For this exercise, we are going to open up a context manager for the notebook using the rasterio.env module to store the required GDAL and AWS configurations we need to access the data in Earthdata Cloud. While the context manager is open (rio_env.__enter__()) we will be able to run the open or get data commands that would typically be executed within a with statement, thus allowing us to more freely interact with the data. We’ll close the context (rio_env.__exit__()) at the end of the notebook.

GDAL environment variables must be configured to access Earthdata Cloud data assets. Geospatial data access Python packages like rasterio and rioxarray depend on GDAL, leveraging GDAL’s “Virtual File Systems” to read remote files. GDAL has a lot of environment variables that control it’s behavior. Changing these settings can mean the difference being able to access a file or not. They can also have an impact on the performance.

rio_env = rio.Env(AWSSession(session),
                  GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
                  GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
                  GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt'))
rio_env.__enter__()
<rasterio.env.Env at 0x7fe992047490>

Read in a single HLS file

We’ll access the HLS S3 object using the rioxarray Python package. The package is an extension of xarray and rasterio, allowing users to read in and interact with geospatial data using xarray data structures. We will also be leveraging the tight integration between xarray and dask to lazily read in data via the chunks parameter. This allows us to connect to the HLS S3 object, reading only metadata, an not load the data into memory until we request it via the loads() function.

hls_da = rioxarray.open_rasterio(s3_link, chuncks=True)
hls_da
<xarray.DataArray (band: 1, y: 3660, x: 3660)>
[13395600 values with dtype=int16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.097e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    -9999.0
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red

When GeoTIFFS/Cloud Optimized GeoTIFFS are read in, a band coordinate variable is automatically created (see the print out above). In this exercise we will not use that coordinate variable, so we will remove it using the squeeze() function to avoid confusion.

hls_da = hls_da.squeeze('band', drop=True)
hls_da
<xarray.DataArray (y: 3660, x: 3660)>
[13395600 values with dtype=int16]
Coordinates:
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.097e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    -9999.0
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red

Plot the HLS S3 object

hls_da.hvplot.image(x='x', y='y', cmap='fire', rasterize=True, width=800, height=600, colorbar=True)    # colormaps -> https://holoviews.org/user_guide/Colormaps.html
Unable to display output for mime type(s): 

We can print out the data value as a numpy array by typing .values

hls_da.values
array([[-9999, -9999, -9999, ...,  1527,  1440,  1412],
       [-9999, -9999, -9999, ...,  1493,  1476,  1407],
       [-9999, -9999, -9999, ...,  1466,  1438,  1359],
       ...,
       [-9999, -9999, -9999, ...,  1213,  1295,  1159],
       [-9999, -9999, -9999, ...,  1042,  1232,  1185],
       [-9999, -9999, -9999, ...,   954,  1127,  1133]], dtype=int16)

Up to this point, we have not saved anything but metadata into memory. To save or load the data into memory we can call the .load() function.

hls_da_data = hls_da.load()
hls_da_data
<xarray.DataArray (y: 3660, x: 3660)>
array([[-9999, -9999, -9999, ...,  1527,  1440,  1412],
       [-9999, -9999, -9999, ...,  1493,  1476,  1407],
       [-9999, -9999, -9999, ...,  1466,  1438,  1359],
       ...,
       [-9999, -9999, -9999, ...,  1213,  1295,  1159],
       [-9999, -9999, -9999, ...,  1042,  1232,  1185],
       [-9999, -9999, -9999, ...,   954,  1127,  1133]], dtype=int16)
Coordinates:
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.097e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    -9999.0
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red
del(hls_da_data)

Read in and clip a single HLS file

To clip the HLS file, our feature representing our region of interest must be in the same coordinate reference system (CRS) or projection coordinate system as the HLS file. The map projection for our HLS file is Universal Transverse Mercator (UTM) zone 13N. Our feature is mapped to WGS84 geographic coordinate system grid space. We need to transform the geographic coordinate reference system (CRS) of our feature to the UTM projected coordinate system (i.e., UTM Zone 13N)

Read in our geojson file and transform its CRS

field = geopandas.read_file('./data/ne_w_agfields.geojson')

Let’s take a look at the bounding coordinate values.

field_shape = field.geometry[0]
field_shape.bounds
(-101.67271614074707,
 41.04754380304359,
 -101.65344715118408,
 41.06213891056728)

Note, the values above are in decimal degrees and represent the longitude and latitude for the lower left corner (-101.67271614074707, 41.04754380304359) and upper right corner (-101.65344715118408, 41.06213891056728) respectively.

Get the projection information from the HLS file

hls_proj = hls_da.rio.crs
hls_proj
CRS.from_epsg(32613)

Transform coordinates from lat lon (units = dd) to UTM (units = m)

geo_CRS = Proj('+proj=longlat +datum=WGS84 +no_defs', preserve_units=True)   # Source coordinate system of the ROI
project = pyproj.Transformer.from_proj(geo_CRS, hls_proj)                    # Set up the transformation
fsUTM = transform(project.transform, field_shape)
fsUTM.bounds
(779588.4994601272, 4549370.366049466, 781270.1479326887, 4551052.979639321)

The coordinates for our feature have now been converted to UTM Zone 13N whether meters is the designated unit. Note the difference in the values between field_shape.bounds (in geographic) and fsUTM.bounds (in UTM projection).

Now we can clip our HLS file to our region of insterest!

Access and clip the HLS file

We can now use our transformed ROI bounding box to clip the HLS S3 object we accessed before. We’ll use the `rio.clip

hls_da_clip = rioxarray.open_rasterio(s3_link, chunks=True).squeeze('band', drop=True).rio.clip([fsUTM])
hls_da_clip
<xarray.DataArray (y: 56, x: 56)>
dask.array<astype, shape=(56, 56), dtype=int16, chunksize=(56, 56), chunktype=numpy.ndarray>
Coordinates:
  * y            (y) float64 4.551e+06 4.551e+06 ... 4.549e+06 4.549e+06
  * x            (x) float64 7.796e+05 7.796e+05 ... 7.812e+05 7.812e+05
    spatial_ref  int64 0
Attributes:
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red
    _FillValue:    -9999
hls_da_clip.hvplot.image(x = 'x', y = 'y', crs = 'EPSG:32613', cmap='fire', rasterize=True, width=800, height=600, colorbar=True)
Unable to display output for mime type(s): 

Read in and clip an HLS time series

Now we’ll read in multiple HLS S3 objects as a time series xarray. Let’s print the links list again to see what we’re working with.

s3_links
['s3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021133T172406.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021133T173859.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021140T173021.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021140T172859.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021145T172901.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021156T173029.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021163T173909.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021165T172422.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021165T172901.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021185T172901.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021188T173037.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021190T172859.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021198T173911.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021200T172859.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021203T173909.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021204T173042.v1.5.B04.tif',
 's3://lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2021215T172901.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021220T173049.v1.5.B04.tif',
 's3://lp-prod-protected/HLSL30.015/HLS.L30.T13TGF.2021229T172441.v1.5.B04.tif']

Currently, the utilities and packages used in Python to read in GeoTIFF/COG file do not recognize associated dates stored in the internal metadata. To account for the dates for each file we must create a time variable and add it as a dimension in our final time series xarray. We’ll create a function that extracts the date from the file link and create an xarray variable with a time array of datetime objects.

def time_index_from_filenames(file_links):
    '''
    Helper function to create a pandas DatetimeIndex
    '''
    return [datetime.strptime(f.split('.')[-5], '%Y%jT%H%M%S') for f in file_links]
time = xr.Variable('time', time_index_from_filenames(s3_links))

We’ll now specify a chunk size to use that matches the internal tiling of HLS files. This will help improve performance.

chunks=dict(band=1, x=1024, y=1024)

Now, we will create our time series.

hls_ts_da = xr.concat([rioxarray.open_rasterio(f, chunks=chunks).squeeze('band', drop=True) for f in s3_links], dim=time)
hls_ts_da
<xarray.DataArray (time: 19, y: 3660, x: 3660)>
dask.array<concatenate, shape=(19, 3660, 3660), dtype=int16, chunksize=(1, 1024, 1024), chunktype=numpy.ndarray>
Coordinates:
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.097e+05
  * y            (y) float64 4.6e+06 4.6e+06 4.6e+06 ... 4.49e+06 4.49e+06
    spatial_ref  int64 0
  * time         (time) datetime64[ns] 2021-05-13T17:24:06 ... 2021-08-17T17:...
Attributes:
    _FillValue:    -9999.0
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red

Since we used the chunks parameter while reading the data, the hls_ts_da object is read into memory. To do that we’ll use the load() function. But, before that, we’ll clip the hls_ts_da object to our roi using our transformed roi coordinates.

hls_ts_da_clip = hls_ts_da.rio.clip([fsUTM]).load()
hls_ts_da_clip
<xarray.DataArray (time: 19, y: 56, x: 56)>
array([[[-9999, -9999, -9999, ...,   980, -9999, -9999],
        [-9999, -9999, -9999, ...,   287, -9999, -9999],
        [ 1573,  1692,  1708, ...,   410, -9999, -9999],
        ...,
        [-9999, -9999,  1165, ...,  1808,  1869,  1906],
        [-9999, -9999,   989, ..., -9999, -9999, -9999],
        [-9999, -9999,  1085, ..., -9999, -9999, -9999]],

       [[-9999, -9999, -9999, ...,   860, -9999, -9999],
        [-9999, -9999, -9999, ...,   496, -9999, -9999],
        [ 2681,  2773,  2496, ...,   550, -9999, -9999],
        ...,
        [-9999, -9999,  3847, ...,  1997,  1914,  1831],
        [-9999, -9999,  4062, ..., -9999, -9999, -9999],
        [-9999, -9999,  4313, ..., -9999, -9999, -9999]],

       [[-9999, -9999, -9999, ...,   808, -9999, -9999],
        [-9999, -9999, -9999, ...,   230, -9999, -9999],
        [ 1802,  1828,  1863, ...,   306, -9999, -9999],
        ...,
...
        ...,
        [-9999, -9999,  1124, ...,   804,   934,  1008],
        [-9999, -9999,  1003, ..., -9999, -9999, -9999],
        [-9999, -9999,   904, ..., -9999, -9999, -9999]],

       [[-9999, -9999, -9999, ...,  1313, -9999, -9999],
        [-9999, -9999, -9999, ...,  1327, -9999, -9999],
        [ 1091,  1094,  1179, ...,  1223, -9999, -9999],
        ...,
        [-9999, -9999,  1145, ...,  1005,  1097,  1197],
        [-9999, -9999,  1037, ..., -9999, -9999, -9999],
        [-9999, -9999,  1114, ..., -9999, -9999, -9999]],

       [[-9999, -9999, -9999, ...,  1272, -9999, -9999],
        [-9999, -9999, -9999, ...,  1231, -9999, -9999],
        [ 1086,  1105,  1193, ...,  1205, -9999, -9999],
        ...,
        [-9999, -9999,  1045, ...,  1049,  1142,  1219],
        [-9999, -9999,   926, ..., -9999, -9999, -9999],
        [-9999, -9999,  1076, ..., -9999, -9999, -9999]]], dtype=int16)
Coordinates:
  * y            (y) float64 4.551e+06 4.551e+06 ... 4.549e+06 4.549e+06
  * x            (x) float64 7.796e+05 7.796e+05 ... 7.812e+05 7.812e+05
  * time         (time) datetime64[ns] 2021-05-13T17:24:06 ... 2021-08-17T17:...
    spatial_ref  int64 0
Attributes:
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red
    _FillValue:    -9999

Now, we’ll see what we have. Use hvplot to plot the clipped time series

hls_ts_da_clip.hvplot.image(x='x', y='y', width=800, height=600, colorbar=True, cmap='fire').opts(clim=(hls_ts_da_clip.values.min(), hls_ts_da_clip.values.max()))
Unable to display output for mime type(s): 
# Exit our context
rio_env.__exit__()

Resourses